VIVO: Visual Vocabulary Pre-Training for Novel Object Captioning
نویسندگان
چکیده
It is highly desirable yet challenging to generate image captions that can describe novel objects which are unseen in caption-labeled training data, a capability evaluated the object captioning challenge (nocaps). In this challenge, no additional image-caption other than COCO Captions, allowed for model training. Thus, conventional Vision-Language Pre-training (VLP) methods cannot be applied. This paper presents VIsual VOcabulary pre-training (VIVO) performs absence of caption annotations. By breaking dependency paired data VLP, VIVO leverage large amounts image-tag learn visual vocabulary. done by multi-layer Transformer learns align image-level tags with their corresponding region features. To address unordered nature tags, uses Hungarian matching loss masked tag prediction conduct pre-training. We validate effectiveness fine-tuning pre-trained captioning. addition, we perform an analysis visual-text alignment inferred our model. The results show not only fluent objects, but also identify locations these objects. Our single has achieved new state-of-the-art on nocaps and surpassed human CIDEr score.
منابع مشابه
Supplemental Material Deep Compositional Captioning: Describing Novel Object Categories without Paired Training Data
We present further empirical and qualitative results for both image and video description. For the image description task, we explore averaging weight vectors before transfer, illustrate errors made by the model when no unpaired text data is used during training and provide descriptions generated by DCC for a large variety of novel object categories in ImageNet. For the video description task, ...
متن کاملOracle Performance for Visual Captioning
The task of associating images and videos with a natural language description has attracted a great amount of attention recently. The state-of-the-art results on some of the standard datasets have been pushed into the regime where it has become more and more difficult to make significant improvements. Instead of proposing new models, this work investigates performances that an oracle can obtain...
متن کاملConcept-Specific Visual Vocabulary Construction for Object Categorization
Recently, the bag-of-words (BOW) based image representation is getting popular in object categorization. However, there is no available visual vocabulary and it has to be learned. As to traditional learning methods, the vocabulary is constructed by exploring only one type of feature or simply concatenating all kinds of visual features into a long vector. Such constructions neglect distinct role...
متن کاملVisual Vocabulary Signature for 3D Object Retrieval and Partial Matching
In this paper a novel object signature is proposed for 3D object retrieval and partial matching. A part-based representation is obtained by partitioning the objects into subparts and by characterizing each segment with different geometric descriptors. Therefore, a Bag of Words framework is introduced by clustering properly such descriptors in order to define the so called 3D visual vocabulary. ...
متن کاملActor-Critic Sequence Training for Image Captioning
Generating natural language descriptions of images is an important capability for a robot or other visual-intelligence driven AI agent that may need to communicate with human users about what it is seeing. Such image captioning methods are typically trained by maximising the likelihood of ground-truth annotated caption given the image. While simple and easy to implement, this approach does not ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Proceedings of the ... AAAI Conference on Artificial Intelligence
سال: 2021
ISSN: ['2159-5399', '2374-3468']
DOI: https://doi.org/10.1609/aaai.v35i2.16249